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Abstract: 

Advanced  shipboard  radar  systems  will  be  required  to  detect,  track  and  classify  ballistic  missile  targets 
and  re-entry  vehicles,  as  well  as  perform  traditional  Anti-Air  Warfare  (AAW)  operations.  New  Open 
Architecture  Systems,  including  COTS  hardware  as  well  as  open  system  software,  are  required  to 
implement  the  necessary  algorithms  for  successful  missile  defense.  Lockheed  Martin  MS2  has  been 
developing  such  open  architecture  systems  for  the  next  generation  Aegis  Combat  Systems,  which  will 
include  the  embedded  equipment  and  computer  programs  that  are  necessary  for  an  effective  missile 
defense.  Lockheed  Martin  has  made  extensive  use  of  Open  Architecture  (OA)  software  and  industry 
standard  Application  Programming  Interfaces  (APIs)  in  order  to  provide  the  Navy  and  Missile  Defense 
Agency  with  efficient,  open  architecture  software  that  exhibits  unprecedented  Portability  across 
computing  platforms,  vendor  design  environments,  processor  architectures  and  technology  upgrades. 

As  we  move  forward  with  development  of  a  deployable  shipboard  missile  defense  system,  it  has  become 
obvious  that  a  state-of-the-art,  C++  based  object  oriented  design  environment  and  signal  processing  API 
Library  would  be  extremely  beneficial  in  the  development  of  open  architecture  application  software,  with 
the  advantageous  portability  features  provided  by  the  original  C-based  VSIPL  API.  For  this  reason, 
Lockheed  Martin  has  been  an  active  participant  in  development  of  a  next  generation  C++  signal  and 
image  processing  API  through  the  High  Performance  Embedded  Computing  Software  Initiative  (HPEC- 
SI). 

This  briefing  describes  an  effort  to  implement  advanced  Shipboard  Ballistic  Missile  Defense  (SBMD) 
application  algorithms  utilizing  HPEC-SI.  Shipboard  application  code,  previously  written  in  the  C 
programming  language  for  conventional  COTS  PowerPC-based  embedded  architectures,  is  being 
converted  by  Lockheed  Martin  MS2,  as  an  HPEC-SI  Demonstration,  to  run  under  the  HPEC-SI  API.  The 
C  code,  designed  to  run  in  a  C  environment,  will  be  converted  to  the  HPEC-SI  API  standard  to  run  under 
a  true  C++  Object  Oriented  environment,  and  will  eventually  take  advantage  of  the  HPEC-SI  parallel 
processing  features. 

Of  particular  interest  in  this  conversion  is  a  comparison  of  key  DoD  processing  algorithms  executed  on  a 
conventional,  embedded  processing  architecture  using  C  and  C  application  libraries,  as  compared  with 
execution  in  an  embedded  HPEC-SI  processing  environment  The  goals  of  this  effort  were  to: 

•  Demonstrate  a  critical  embedded  DoD  BMD  signal  processing  application  using  the  HPEC-SI  API 
under  development  on  the  HPEC-SI  initiative 
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•  Compare  the  engineering  development  metrics,  contrasting  the  conventional  C  API  software 
development  environment  with  the  C++  HPEC-SI  Object  Oriented  software  development 
environment,  and 

•  Compare  relative  code  size  and  development  cost  with  the  C  API 

In  this  briefing,  we  describe  the  porting  of  several  of  the  signal  processing  algorithms  that  have  been 
developed  using  C-based  VSIPL,  and  port  them  to  the  HPEC-SI  VSIPL++  API  under  development.  As 
part  of  this  process,  the  HPEC-SI  community  will  receive  valuable  feedback  regarding  the  HPEC-SI  API 
implementation,  including  the  development  process,  development  metrics,  development  environment 
issues  and  key  library  functions.  Eventually,  the  open  architecture  HPEC-SI  VSIPL++  code  developed  for 
the  Navy  and  MDA  will  be  ported  to  a  tactical  system  for  deployment  on  Aegis  cruisers  and  destroyers. 
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HPEC  Software  Initiative  (HPEC-SI)  Goals 

■  Develop  software  technologies  for  embedded  parallel  systems  to  address: 

■  Portability 

■  Productivity 

■  Performance 

■  Deliver  quantifiable  benefits 


Current  HPEC-SI  Focus 

Development  of  the  VSIPL++  and  Parallel 
VSIPL++  Standards 

■  VSIPL++ 


\ 


VSIPL++  Development  \ 
Process 

■  Development  of  the  VSIPL++ 
Reference  Specification 


■  A  C++  API  based  on  concepts  from  VSIPL 
(an  existing,  industry  accepted  standard  for 
signal  processing) 

■  VSIPL++  allows  us  to  take  advantage  of 
useful  C++  features 

Parallel  VSIPL++  is  an  extension  to  VSIPL++  for 
multi-processor  execution 


■  Creation  of  a  reference 
implementation  of  VSIPL++ 


■  Creation  of  demo  applications 
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LOCKHEED  M  A  R  T  1  ill  / 


■  Use  CodeSourcery’s  VSIPL++  reference  implementation  in  a  main-stream  DoD 
Digital  Signal  Processor  Application 

■  Utilize  existing  “real-world”  tactical  application  Synthetic  WideBand  (SWB) 
Radar  Mode.  The  original  code  was  developed  for  the  United  States  Navy  and 
MDA  under  contract  for  improved  S-Band  Discrimination.  SWB  is  continuing  to 
be  evolved  by  MDA  for  Aegis  BMD  signal  processor. 


Identify  areas  for  improved  or  expanded  functionality  and  usability 

Milestone  4 


Milestone  3 


Application  analysis 
Feedback  &  recommendations 


Milestone  2 


Port  SWB  Application  to  embedded  platforms 
Mercury,  Sky 


Milestone  1 


Convert  SWB  Application  to  use  VSIPL++  API 
Unix,  Linux 


Successfully  build  VSIPL++  API 
Unix,  Linux,  Mercury 
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Performance  C1.Sk) 


VSIPL++ 

Standards  -  Development  Loop 
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During  development,  there  was  a  continuous  loop  of  change 
requests/feedback,  and  API  updates  and  patches 
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Lockheed  Martin  Software 
Risk  Reduction  Issues 
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■  General  mission  system  requirements 

■  Maximum  use  of  COTS  equipment,  software  and  commercial  standards 

■  Support  high  degree  of  software  portability  and  vendor  interoperability 


■  Software  Risk  Issues 

■  Real-time  operation 

□  Latency 

□  Bandwidth 

□  Throughput 

■  Portability  and  re-use 

□  Across  architectures 

□  Across  vendors 

□  With  vendor  upgrades 

■  Real-time  signal  processor  control 

□  System  initialization 

□  Fault  detection  and  isolation 

□  Redundancy  and  reconfiguration 

■  Scalability  to  full  tactical  signal  processor 
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Lockheed  Martin  Software 
Risk  Reduction  Efforts 
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■  Benchmarks  on  vendor  systems  (CSPI,  Mercury,  HP,  Cray,  Sky,  etc.) 

■  Communication  latency/throughput 

■  Signal  processing  functions  (e.g.,  FFTs) 

■  Applications 

■  Use  of  and  monitoring  of  industry  standards 

■  Communication  standards:  MPI,  MPI-2,  MPI/RT,  Data  Re-org,  CORBA 

■  Signal  processing  standards:  VSIPL,  VSIPL++ 

■  Technology  refresh  experience  with  operating  system,  network,  and  processor  upgrades 
(e.g.,  CSPI,  SKY,  Mercury) 

■  Experience  with  VSIPL 

■  Participation  in  standardization  effort 

■  Implementation  experience 

□  Porting  of  VSIPL  reference  implementation  to  embedded  systems 

□  C++  wrappers 

■  Application  modes  developed 

□  Programmable  Energy  Search 

□  Programmable  Energy  Track 

□  Cancellation 

□  Moving  Target  Indicator 

□  Pulse  Doppler 
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□  Synthetic  Wideband 
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Lockheed  Martin  Math  Library 
Experience 


LOCKHEED  MARTIN 


Vendor  supplied  math  libraries 

■  Advantages 

□  Performance 

■  Disadvantages 

□  Proprietary  Interface 

_ □  Portability _ 


VSIPL  standard 
■  Advantages 

□  Performance 

□  Portability 

□  Standard  interface 
Disadvantages 

□  Verbose  interface 


(higher  %  of  management  SLOCS) 


VSIPL++  standard 

■  Advantages 

□  Standard 
interface 

■  To  Be  Determined 

□  Performance 

□  Portability 

□  Productivity 


Vendor 

Libraries 


LM  Proprietary 
C  Wrappers 


VSI  PL 
Library 


LM  Proprietary 
C++  Library 


VSI  PL+  + 
Library 


Vendor  libraries  wrapped  with  #ifdef’s 

■  Advantages 

□  Performance 

□  Portability 

■  Disadvantages 

_ □  Proprietary  interface 


Thin  VSIPL-like  C++  wrapper 

■  Advantages 

□  Performance 

□  Portability 

□  Productivity 

(fewer  SLOCS,  better  error  handling) 

■  Disadvantages 

□  Proprietary  interface 

□  Partial  implementation 
(didn’t  wrap  everything) 
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■  Overview 

■  Lockheed  Martin  Background  and  Experience 

■  VSIPL++  Application 

■  Overview 

■  Application  Interface 

■  Processing  Flow 

■  Software  Architecture 

■  Algorithm  Case  Study 

■  Conclusion 
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Application  Overview 
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■  The  Lockheed  Martin  team  took  existing  Synthetic  Wideband 
application,  developed  and  targeted  for  Aegis  BMD  signal  processor 
implementation,  and  rewrote  it  to  use  and  take  advantage  of  the 
VSIPL++ 

■  The  SWB  Application  achieves  a  high  bandwidth  resolution  using 
narrow  bandwidth  equipment,  for  the  purposes  of  extracting  target 
discriminant  information  from  the  processed  range  doppler  image 

■  Synthetic  Wideband  was  chosen  because: 

■  It  exercises  a  number  of  algorithms  and  operations  commonly  used 
in  our  embedded  signal  processing  applications 

■  Its  scope  is  small  enough  to  finish  the  task  completely,  yet  provide 
meaningful  feedback  in  a  timely  manner 

■  Main-stream  DoD  application 
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Application  Overview  - 
Synthetic  WideBand  Processing 
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Synthetic  Wideband  Waveform  Processing 


By  using  “Stepped”  medium 
band  pulses,  and  specialized 
algorithms, 

an  effective  “synthetic”  wide  band 
measurement  can  be  obtained 


1.  Transmit  and  Receive 
Mediumband  Pulses 


✓ 


Mediumband  Pulses 


Range 

3.  Coherently  Combine 
Mediumband  Pulses  to 
Obtain  Synthetic 
Wideband  Response 
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■  Requires  accurate  knowledge  of 
target  motion  over  waveform 
duration 

■  Requires  phase  calibration  as  a 
function  of  mediumband  pulse 
center  frequency 


Application  Interface 
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Calibration  Data 


Algorithm  Control 
Parameters 


Control  &  Radar  Data 


Hardware  Mapping  Information 
(How  application  is  mapped  to  processors) 


SWB 
Application 


Processing  Results 

■  Images 

■  Features 
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Processing  Flow 
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Software  Architecture 
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Application  “main” 

Ties  together  a  set  of 
tasks  to  build  the  overall 
application 


Input 


Radar 

Data 


Tasks 


Data-parallel  code  that  can  be  mapped  to  a  set  of 
processors  and/or  strung  together  into  a  data  flow. 
Tasks  are  responsible  for: 

■  Sending  and/or  receiving  data 

■  Processing  the  data  (using  the  algorithms) 

■  Reading  the  stimulus  control  data  and  passing  any 
needed  control  parameters  into  the  algorithms 


Output 


CPI  Task 

Coherent  Int  Task 

y 

Coherent  Integration 

Algorithms 

Library  of  higher-level,  application- 
oriented  math  functions  with  VSIPL-like 
interface 

■  Interface  uses  views  for 
input/output 

■  Algorithms  never  deal  explicitly 
with  data  distribution  issues 


Interpolation 


Pulse  Compression 


Range  Walk  Comp 


Doppler  Comp 


Synthetic  Upmix 
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■  Overview 

■  Lockheed  Martin  Background  and  Experience 

■  VSIPL++  Application 

■  Overview 

■  Application  Interface 

■  Processing  Flow 

■  Software  Architecture 

■  Algorithm  Case  Study 

■  Conclusion 
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Algorithm  Case  Study  Overview 


LOCKHEED  M A 


■  Goal 

■  Show  how  we  reached  some  of  our  VSIPL++  conclusions  by 
walking  through  the  series  of  steps  needed  to  convert  a  part  of  our 
application  from  VSIPL  to  VSIPL++ 

■  Algorithm 

■  Starting  point 

□  Simplified  version  of  a  pulse  compression  kernel 

□  Math:  output  =  ifft(  fft(input)  *  reference) 

■  Add  requirements 

□  Error  handling 

□  Decimate  input 

□  Support  both  single  and  double  precision 

□  Port  application  to  embedded  system 
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Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 


void  pulseCompress(vsip_cvview_f  *in,  vsip_cvview_f  *ref,  vsip_cvview_f  *out)  { 
vsipjength  size  =  vsip_cvgetlength_f(in); 


vsip_fft_f  *forwardFft  =  vsip_ccfftop_create_f(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsip_fft_f  *inverseFft  =  vsip_ccfftop_create_f(size,  1.0/size,  VSIP_FFT_INV,  1,  VSIP_ALG_SPACE); 


Observations 


vsip_cvview_f  *tmpViewl  =  vsip_cvcreate_f(size,  VSIP_MEM_NONE); 
vsip_cvview_f  *tmpView2  =  vsip_cvcreate_f(size,  VSIP_MEM_NONE); 

vsip_ccfftop_f(forwardFft,  in,  tmpViewl); 
vsip_cvmul_f(tmpViewl,  ref,  tmpView2); 
vs  i  p_ccf  f  to  p_f  (i  n  ve  r  se  Ff  t ,  tmpView2,  out); 

vsip_cvalldestroy_f(tmpViewl); 
vsip_cvalldestroy_f(tmpView2); 
vs  i  p_f  f  t_d  est  r  oy_f  (f  o  rwar  d  Ff t) ; 
vsip_fft_destroy_f(inverseFft); 

} 


void  pulseCompress(const  vsip::Vector<  std::complex<float>  >  &in, 
const  vsip::Vector<  std::complex<float>  >  &ref, 
const  vsip::Vector<  std::complex<float>  >  &out)  { 

int  size  =  in.size(); 


VSIPL++  code  has  fewer  SLOCS  than  VSIPL  code 
(5  VSIPL++  SLOCS  vs.  13  VSIPL  SLOCS) 

VSIPL++  syntax  is  more  complex  than  VSIPL  syntax 

■  Syntax  for  FFT  object  creation 

■  Extra  set  of  parenthesis  needed  in  defining 
Domain  argument  for  FFT  objects 

VSIPL  code  includes  more  management  SLOCS 

■  VSIPL  code  must  explicitly  manage  temporaries 

■  Must  remember  to  free  temporary  objects  and  FFT 
operators  in  VSIPL  code 

VSIPL++  code  expresses  core  algorithm  in  fewer  SLOCS 

■  VSIPL++  code  expresses  algorithm  in  one  line, 
VSIPL  code  in  three  lines 

■  Performance  of  VSIPL++  code  may  be  better  than 
VSIPL  code 


vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1.0); 

vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFTJNV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 


inverseFft(  ref  *  forward Fft(in),  out ); 

> 
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Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Catch  any  errors  and  propagate  error  status 


int  pulseCompress(vsip_cvview_f  *in,  vsip_cvview_f  *ref,  vsip_cvview_f  *out)  { 

int  valid  =  0; 

vsipjength  size  =  vsip_cvgetlength_f(in); 


vsip_fft_f *forwardFft  =  vsip_ccfftop_create_f(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsip_fft_f  *inverseFft  =  vsip_ccfftop_create_f(size,  1.0/size,  VSIP_FFT_INV,  1,  VSIP_ALG_SPACE); 


vsip_cvview_f  *tmpViewl  =  vsip_cvcreate_f(size,  VSIP_MEM_NONE); 
vsip_cvview_f  *tmpView2  =  vsip_cvcreate_f(size,  VSIP_MEM_NONE); 


£ 


if  (forwardFft  &&  inverseFft  &&  tmpViewl  &&  tmpView2)  { 

vsip_ccfftop_f(forwardFft,  in,  tmpViewl); 
vsip_cvmul_f(tmpViewl,  ref,  tmpView2); 
vs  i  p_ccf  f  to  p_f  (i  n  ve  r  se  Ff  t ,  tmpView2,  out); 

valid=l; 


if  (tmpViewl)  vsip_cvalldestroy_f(tmpViewl); 
if  (tmpView2)  vsip_cvalldestroy_f(tmpView2); 
if  (forwardFft)  vsip_fft_destroy_f(forwardFft); 
if  (inverseFft  vsi p_fft_d estr oy_f (i n verseFft) ; 

return  valid; 


+ 

+ 

■ 


void  pulseCompress(const  vsip::Vector<  std::complex<float>  >  &in, 
const  vsip::Vector<  std::complex<float>  >  &ref, 
const  vsip::Vector<  std::complex<float>  >  &out)  { 

int  size  =  in.sizeQ; 


Observations 

VSIPL  code  additions  are  highlighted 

■  No  changes  to  VSIPL++  function  due  to  VSIPL++ 
support  for  C++  exceptions 

■  5  VSIPL++  SLOCS  vs.  1 7  VSIPL  SLOCS 

VSIPL  behavior  not  defined  by  specification  if  there  are 
errors  in  fft  and  vector  multiplication  calls 

■  For  example,  if  lengths  of  vector  arguments 
unequal,  implementation  may  core  dump,  stop 
with  error  message,  silently  write  past  end  of 
vector  memory,  etc 

■  FFT  and  vector  multiplication  calls  do  not  return 
error  codes 


vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFT_FWD>  forwardFft  ((vsip: :Domain<l>(size)),  1.0); 

vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFT_INV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 
inverseFft(  ref  *  forward Fft(in),  out ); 

> 
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Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Decimate  input  by  N  prior  to  first  FFT 


void  pulseCompress(  int  decimationFactor,  vsip_cvview_f  *in,  vsip_cvview_f  *ref,  vsip_cvview_f  *out)  { 
vsipjength  size  =  vsip_cvgetlength_f(in)  /  decimationFactor; 

vsip_fft_f  *forwardFft  =  vsip_ccfftop_create_f(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsip_fft_f  *inverseFft  =  vsip_ccfftop_create_f(size,  1.0/size,  VSIP_FFT_INV,  1,  VSIP_ALG_SPACE); 


vsip_cvview_f  *tmpViewl  =  vsip_cvcreate_f(size,  VSIP_MEM_NONE); 
vsip_cvview_f  *tmpView2  =  vsip_cvcreate_f(size,  VSIP_MEM_NONE); 

vs i p_cvpu tst ri d e_f (i n ,  decimationFactor); 
vsip_cvputlength_f(in,  size); 

vsip_ccfftop_f(forwardFft,  in,  tmpViewl); 
vsip_cvmul_f(tmpViewl,  ref,  tmpView2); 
vsip_ccfftop_f(inverseFft,  tmpView2,  out); 


vsip_cvalldestroy_f(tmpViewl); 
vsip_cvalldestroy_f(tmpView2); 
vsi  p_fft_d  estroy_f  (forward  Fft) ; 
vsip_fft_destroy_f(inverseFft); 


} 


void  pulseCompresspnt  decimationFactor,  const  vsip::Vector<  std::complex<float>  >  &in, 
const  vsip::Vector<  std::complex<float>  >  &ref 
const  vsip::Vector<  std::complex<float>  >  &out)  { 

int  size  =  in.sizeO  /  decimationFactor; 

vsip::Domain<l>  decimatedDomain(0,  decimationFactor,  size); 


Observations 


SLOC  count  doesn’t  change  all  that  much  for 
VSIPL  or  VSIPL++  code 

■  2  changed  line  for  VSIPL 

■  3  changed  lines  for  VSIPL++ 

■  2  additional  SLOCS  for  VSIPL 

■  1  additional  SLOC  for  VSIPL++ 

VSIPL  version  of  code  has  a  side-effect 

■  The  input  vector  was  modified  and  not 
restored  to  original  state 

■  This  type  of  side-effect  was  the  cause  of 
many  problems/bugs  when  we  first 
started  working  with  VSIPL 


vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFT_FWD>  forwardFft  ((vsip:  :Domain<l>(size)),  1.0); 

vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFTJNV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip: :Domain<l>(size)),  1.0/size); 

inverseFft(  ref  *  forwardFft(  in(decimatedDomain) ),  out ); 


> 
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Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Decimate  input  by  N  prior  to  first  FFT,  no  side-effects 


void  pulseCompress(  int  decimationFactor,  vsip_cvview_f  *in,  vsip_cvview_f  *ref,  vsip_cvview_f  *out)  { 

vsipjength  savedSize  =  vsip_cvgetlength_f(in); 
vsipjength  savedStride  =  vsip_cvgetstride_f(in); 
vsipjength  size  =  vsip_cvgetlength _f(in)  /  decimationFactor; 

vsip_fft_f  *forwardFft  =  vsip_ccfftop_createJ(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsipjftj  *inverseFft  =  vsip_ccfftop_createJ(size,  1.0/size,  VSIP_FFTJNV,  1,  VSIP_ALG_SPACE); 
vsip_cvviewj  *tmpViewl  =  vsip_cvcreatej(size,  VSIP_MEM_NONE); 
vsip_cvviewJ*tmpView2  =  vsip_cvcreatej(size,  VSIP_MEM_NONE); 
vsip_cvputlengthj(in,  size); 
vsip_cvputstridej(in,  decimationFactor); 


Observations 


vsip_ccfftopJ(forwardFft,  in,  tmpViewl); 
vsip_cvmulJ(tmpViewl,  ref,  tmpView2); 
vsip_ccfftopJ(inverseFft,  tmpView2,  out); 

vsip_cvputlengthj(in,  savedSize); 
vsip_cvputstridej(in,  savedStride); 

vsip_cvalldestroyJ(tmpViewl); 

vsip_cvalldestroyJ(tmpView2); 

vsipJft_destroyJ(forwardFft); 

vsipJft_destroyJ(inverseFft); 


VSIPL  code  must  save  away  the  input  vector 
state  prior  to  use  and  restore  it  before  returning 

Code  size  changes 

■  VSIPL  code  requires  4  additional  SLOCS 

■  VSIPL++  code  does  not  change  from  prior 
version 


} 


void  pulseCompress(int  decimationFactor,  const  vsip::Vector<  std::complex<float>  >  &in, 
const  vsip::Vector<  std::complex<float>  >  &ref 
const  vsip::Vector<  std::complex<float>  >  &out)  { 
int  size  =  in.size()  /  decimationFactor; 
vsip::Domain<l>  decimatedDomain(0,  decimationFactor,  size); 

vsip::FFT<vsip::Vector,  vsip::cscalarj,  vsip::cscalarj,  vsip::FFT_FWD>  forwardFft  ((vsip:  :Domain<l>(size)),  1.0); 

vsip::FFT<vsip::Vector,  vsip::cscalarj,  vsip::cscalarj,  vsip::FFTJNV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip: :Domain<l>(size)),  1.0/size); 
inverseFft(  ref  *  forwardFft(  in(decimatedDomain) ),  out ); 

> 


Lockheed  Martin  Corporation 


Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Support  both  single  and  double  precision  floating  point 


void  pulseCompress(vsip_cvview_f  *in,  vsip_cvview_f  *ref,  vsip_cvview_f  *out)  { 
vsipjength  size  =  vsip_cvgetlength_f(in); 

vsip_fft_f  *forwardFft  =  vsip_ccfftop_create_f(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsip_fft_f  *inverseFft  =  vs  i  p_ccffto  p_c  r  eate_f  (s  ize ,  1.0/size,  VSIP_FFT_INV,  1,  VSIP_ALG_SPACE); 


} 


vsip_cvview_f  *tmpVie 
vsip_cvview_f  *tmpVie 

vs  i  p_ccf f  to  p_f  (f  o  r  war  d  I 
vsip_cvmul_f(tmpView 
vsip_ccfftop_f(inversel 

vs  i  p_c  val  I  d  est  roy_f  (t  m 
vs  i  p_c  val  I  d  est  roy_f  (t  m 
vsip_fft_destroy. 
vsip_fft_destroy^ 


void  pulseCompress(vsip_cvview^d  *in,  vsip_cvview_d  *ref,  vsip_cvview_d  *out)  { 
vsipjength  size  =  vsip_cvgetlength_d(in); 

vsip_fft_d  *forwardFft  =  vsip_ccfftop_create  d(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsip_fft_d  *inverseFft  =  vs  i  p_c  off  to  p_c  r  e  at  e_d  (size,  1.0/size,  VSIP_FFT_INV,  1,  VSIP_ALG_SPACE); 

vsip_cvview_d  *tmpViewl  =  vsip_cvcreate_d(size,  VSIP_MEM_NONE); 
vsip_cvview_d  *tmpView2  =  vsip_cvcreate_d(size,  VSIP_MEM_NONE); 


vsip_ccfftop  d(forwardFft,  in,  tmpViewl); 
vsip_cvmul  _d(tmpViewl,  ref,  tmpView2); 
vsip_ccfftop  d(inverseFft,  tmpView2,  out); 

vsip_cvalldestroy_d(tmpViewl); 
vsip_cvalldestroy_d(tmpView2); 
vsi  p_fft_d  estr  oy_d  (forward  Fft) ; 
vs  i  p_fft_d  estroy_d  (i  n  ve  rse  Fft) ; 


} 


Observations 


VSIPL++  code  has  same  SLOC  count  as  original 

■  Uses  C++  templates  (3  lines  changed) 

■  Syntax  is  slightly  more  complicated 

VSIPL  code  doubles  in  size 

■  Function  must  first  be  duplicated 

■  Small  changes  must  then  be  made  to  code 
(i.e.,  changing f  to d) 


template<class  T,  class  U,  class  V>  void  pulseCompress(const  T  &in,  const  U  &ref,  const  V  &out)  { 

intsize  =  in.size(); 

vsip::FFT<vsip::Vector,  typename  T::value_type,  typename  V::value_type,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1); 

vsip::FFT<vsip::Vector,  typename  T::valuejype,  typename  V::value_type,  vsip::FFTJNV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 
inverseFft(  ref  *  forwardFft(in),  out ); 
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Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Support  all  previously  stated  requirements 


void  pulseCompress(int  decimationFactor,  vsip 
vsipjength  savedSize  =  vsip_cvgetlength_ff 
vsipjength  savedStride  =  vsip_cvgetstride_f(| 

vsipjength  size  =  vsip_cvgetlength_f(in)  /  dei 

vsip_fft_f  *forwardFft  =  vsip_ccfftop_create_f(| 
vsip_fft_f  *inverseFft  =  vsip_ccfftop_create_f(: 

vsip_cvview_f  *tmpView1  =  vsip_cvcreate_f(s| 
vsip_cvview_f  *tmpView2  =  vsip_cvcreate_f(s| 

if  (forward Fft  &&  inverseFft  &&  tmpViewl  && 

{ 

vsip_cvputlength_f(in,  size); 
vsip_cvputstride_f(in,  decimationFactor); 

vsip_ccfftop_f(forwardFft,  in,  tmpView1)^_ 
vsip_cvmul_f(tmpView1,  ref,  tmpView2| 
vsip_ccfftop_f(inverseFft,  tmpView2,  oi] 

vsip_cvputlength_f(in,  savedSize); 
vsip_cvputstride_f(in,  savedStride); 

} 

if  (tmpViewl )  vsip_cvalldestroy_f(tmpV| 
if  (tmpView2)  vsip_cvalldestroy_f(tmpV 
if  (forward Fft)  vsip_fft_destroy_f(forwardl 
if  (inverseFft)  vsip_fft_destroy_f(inversefl 


cvview  Pin,  vsip  cvview  f*ref,  vsip  cvview  f*out){ 


} 


void  pulseCompress(int  decimationFactor,  vsip_cvview_d  *in,  vsip_cvview_d  *ref,  vsip_cvview_d  *out)  { 
vsipjength  savedSize  =  vsip_cvgetlength_d(in); 
vsipjength  savedStride  =  vsip_cvgetstride_d(in); 

vsipjength  size  =  vsip_cvgetlength_d(in)  /  decimationFactor; 

vsip_fft_d  *forwardFft  =  vsip_ccfftop_create_d(size,  1.0,  VSIP_FFT_FWD,  1 ,  VSIP_ALG_SPACE); 
vsip_fft_d  *inverseFft  =  vsip_ccfftop_create_d(size,  1.0/size,  VSIP_FFTJNV,  1,  VSIP_ALG_SPACE); 

vsip_cvview_d  *tmpView1  =  vsip_cvcreate_d(size,  VSIP_MEM_NONE); 
vsip_cvview_d  *tmpView2  =  vsip_cvcreate_d(size,  VSIP_MEM_NONE); 

if  (forward Fft  &&  inverseFft  &&  tmpViewl  &&  tmpView2) 

{ 

vsip_cvputlength_d(in,  size); 
vsip_cvputstride_d(in,  decimationFactor); 


vsip_ccfftop_d(forwardFft,  in,  tmpViewl); 
vsip_cvmul_d(tmpView1 ,  ref,  tmpView2); 
vsip_ccfftop_d (inverse Fft,  tmpView2,  out); 

vsip_cvputlength_d(in,  savedSize); 
vsip_cvputstride_d(in,  savedStride); 


} 


if  (tmpViewl )  vsip_cvalldestroy_d(tmpView1 ); 
if  (tmpView2)  vsip_cvalldestroy_d(tmpView2); 
if  (forwardFft)  vsip_fft_destroy_d(forwardFft); 
if  (inverseFft)  vsip_fft_destroy_d(inverseFft); 


Observations 


Final  SLOC  count 

■  VSIPL++ -  6  SLOCS 

■  VSIPL  -  40  SLOCS 

(20  each  for  double  and 
single  precision  versions) 


} 


template<class  T,  class  U,  class  V>  void  pulseCompress(int  decimationFactor,  const  T  &in,  const  U  &ref,  const  V  &out)  { 
int  size  =  in.size()  /  decimationFactor; 

vsip::Domain<l>  decimatedDomain(0,  decimationFactor,  size); 

vsip::FFT<vsip::Vector,  typename  T::value_type,  typename  V::value_type,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1); 

vsip::FFT<vsip::Vector,  typename  T::value_type,  typename  V::valuejype,  vsip::FFTJNV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 
inverseFft(  ref  *  forwardFft(  in(decimatedDomain) ),  out ); 

Lockheed  Martin  Corporation 
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Algorithm  Case  Study 


LOCKHEED  MARTIN / 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Port  application  to  high  performance  embedded  systems 


void  pulseCompress(int  decimationFact< 
vsipjength  savedSize  =  vsip_cvgetle 
vsipjength  savedStride  =  vsip_cvgets 

vsipjength  size  =  vsip_cvgetlength_f(i 

vsip_fft_f  *forwardFft  =  vsip_ccfftop_cr 
vsip_fft_f  *inverseFft  =  vsip_ccfftop_crc 

vsip_cvview_f  *tmpView1  =  vsip_cvcre 
vsip_cvview_f  *tmpView2  =  vsip_cvcre 

if  (forward Fft  &&  inverseFft  &&  tmpVie\ 
{ 

vsip_cvputlength_f(in,  size); 
vsip_cvputstride_f(in,  decimationFact 

vsip_ccfftop_f(forwardFft,  in,  tmpViev\ 
vsip_cvmul_f(tmpView1,  ref,  tmpView 
vsip_ccfftop_f(inverseFft,  tmpView2, 

vsip_cvputlength_f(in,  savedSize); 
vsip_cvputstride_f(in,  savedStride); 

} 

if  (tmpViewl )  vsip_cvalldestroy_f(tmp 
if  (tmpView2)  vsip_cvalldestroy_f(tmp 
if  (forward Fft)  vsip_fft_destroy_f(forwar 
if  (inverseFft)  vsip_fft_destroy_f(inversi 


} 


Observations 


Port  to  embedded  Mercury  system 

■  Hardware:  Mercury  VME  chassis  with  PowerPC  compute  nodes 

■  Software:  Mercury  beta  release  of  MCOE  6.0  with  linux  operating 
system.  Mercury  provided  us  with  instructions  for  using  GNU  g++ 
compiler 

■  No  lines  of  application  code  had  to  be  changed 
Port  to  embedded  Sky  system 

■  Hardware:  Sky  VME  chasis  with  PowerPC  compute  nodes 

■  Software:  Sky  provided  us  with  a  modified  version  of  their 
standard  compiler  (added  a  GNU  g++  based  front-end) 

■  No  lines  of  application  code  had  to  be  changed 

Future  availability  of  C++  with  support  for  C++  standard 

■  Improved  C++  support  is  in  Sky  and  Mercury  product  roadmaps 

■  Support  for  C++  standard  appears  to  be  improving  industry  wide 


template<class  T,  class  U,  class  V>  void  pulseCompress(int  decimationFactor,  const  T  &in,  const  U  &ref,  const  V  &out)  { 
int  size  =  in.size()  /  decimationFactor; 

vsip::Domain<l>  decimatedDomain(0,  decimationFactor,  size); 

vsip::FFT<vsip::Vector,  typename  T::value_type,  typename  V::value_type,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1); 

vsip::FFT<vsip::Vector,  typename  T::valuejype,  typename  V::value_type,  vsip::FFTJNV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 
inverseFft(  ref  *  forwardFft(  in(decimatedDomain) ),  out ); 

} 
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Outline 


LOCKHEED  M A 


■  Overview 

■  Lockheed  Martin  Background  and  Experience 

■  VSIPL++  Application 

■  Overview 

■  Application  Interface 

■  Processing  Flow 

■  Software  Architecture 

■  Algorithm  Case  Study 

■  Conclusion 
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Lockheed  Martin  Math  Library 
Experience 


LOCKHEED  MARTIN 


Vendor  supplied  math  libraries 

■  Advantages 

□  Performance 

■  Disadvantages 

□  Proprietary  Interface 

_ □  Portability _ 


VSIPL  standard 
■  Advantages 

□  Performance 

□  Portability 

□  Standard  interface 
Disadvantages 

□  Verbose  interface 


(higher  %  of  management  SLOCS) 


VSIPL++  standard 

■  Advantages 

□  Standard 
interface 

■  To  Be  Determined 

□  Performance 

□  Portability 

□  Productivity 


Vendor 

Libraries 


LM  Proprietary 
C  Wrappers 


VSI  PL 
Library 


LM  Proprietary 
C++  Library 


VSI  PL+  + 
Library 
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Vendor  libraries  wrapped  with  #ifdef’s 

■  Advantages 

□  Performance 

□  Portability 

■  Disadvantages 

_ □  Proprietary  interface 


Thin  VSIPL-like  C++  wrapper 

■  Advantages 

□  Performance 

□  Portability 

□  Productivity 

(fewer  SLOCS,  better  error  handling) 

■  Disadvantages 

□  Proprietary  interface 

□  Partial  implementation 
(didn’t  wrap  everything) 


Lockheed  Martin  Corporation 
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Conclusion 


LOCKHEED  MARTIN 


■  Standard  interface 

■  Productivity 

■  A  VSIPL++  user’s  guide,  including  a  set  of  examples  would  have  been  helpful 

■  The  learning  curve  for  VSIPL++  can  be  somewhat  steep  initially 

■  Fewer  lines  of  code  are  needed  to  express  mathematical  algorithms  in  VSIPL++ 

■  Fewer  maintenance  SLOCS  are  required  for  VSIPL++  programs 

■  Portability 

■  VSIPL++  is  portable  to  platforms  with  support  for  standard  C++ 

■  Most  vendors  have  plans  to  support  advanced  C++  features  required  by  VSIPL++ 

■  Performance 

■  VSIPL++  provides  greater  opportunity  for  performance 

■  Performance-oriented  implementation  is  not  currently  available  to  verify  performance 


Lockheed  Martin  goals  are  well  aligned  with  VSIPL++  goals 
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LOCKHEED  MARTIN 


28 


Lockheed  Martin  Corporation 


Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 


void  pulseCompress(vsip_cvview_f  *in,  vsip_cvview_f  *ref,  vsip_cvview_f  *out)  { 
vsipjength  size  =  vsip_cvgetlength_f(in); 


vsip_fft_f  *forwardFft  =  vsip_ccfftop_create_f(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsip_fft_f  *inverseFft  =  vsip_ccfftop_create_f(size,  1.0/size,  VSIP_FFT_INV,  1,  VSIP_ALG_SPACE); 


Observations 


vsip_cvview_f  *tmpViewl  =  vs  i  p_c  vc  re  ate_f  (s  i  z  e ,  VSIP_MEM_NONE); 
vsip_cvview_f  *tmpView2  =  vs  i  p_c  vc  re  ate_f  (s  i  z  e ,  VSIP_MEM_NONE); 

vs i p_ccf f to p_f (f o r war d Ff t ,  in,  tmpViewl); 
vsip_cvmul_f(tmpViewl,  ref,  tmpView2); 
vs i p_c cf f to p_f ( i n ve rs e Ff t ,  tmpView2,  out); 

vsip_cvalldestroy_f(tmpViewl); 
vsip_cvalldestroy_f(tmpView2); 
vs  i  p_fft_d  estroy_f  (forward  Fft) ; 
vsip_fft_destroy_f(inverseFft); 

} 


void  pulseCompress(const  vsip::Vector<  std::complex<float>  >  &in, 
const  vsip::Vector<  std::complex<float>  >  &ref, 
const  vsip::Vector<  std::complex<float>  >  &out)  { 

intsize  =  in.sizeQ; 


VSIPL++  code  has  fewer  SLOCS  than  VSIPL  code 
(5  VSIPL++  SLOCS  vs.  13  VSIPL  SLOCS) 

VSIPL++  syntax  is  more  complex  than  VSIPL  syntax 

■  Syntax  for  FFT  object  creation 

■  Extra  set  of  parenthesis  needed  in  defining 
Domain  argument  for  FFT  objects 

VSIPL  code  includes  more  management  SLOCS 

■  VSIPL  code  must  explicitly  manage  temporaries 

■  Must  remember  to  free  temporary  objects  and  FFT 
operators  in  VSIPL  code 

VSIPL++  code  expresses  core  algorithm  in  fewer  SLOCS 

■  VSIPL++  code  expresses  algorithm  in  one  line, 
VSIPL  code  in  three  lines 

■  Performance  of  VSIPL++  code  may  be  better  than 
VSIPL  code 


vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1.0); 

vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFT_INV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 


inverseFft(  ref  *  forward Fft(in),  out ); 

} 


29 


Lockheed  Martin  Corporation 


Pentormsarics  G1,5x( 


Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Catch  any  errors  and  propagate  error  status 


int  pulseCompress(vsip_cvview_f  *in,  vsip_cvview_f  *ref,  vsip_cvview_f  *out)  { 
int  valid  =  0; 

vsipjength  size  =  vsip_cvgetlength_f(in); 


vsip_fft_f  *forwardFft  =  vsip_ccfftop_create_f(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsip_fft_f  *inverseFft  =  vs  i  p_ccf  f  to  p_c  r  e  at  e_f  (s  iz  e ,  1.0/size,  VSIP_FFT_INV,  1,  VSIP_ALG_SPACE); 


vsip_cvview_f  *tmpViewl  =  vsip_cvcreate_f(size,  VSIP_MEM_NONE); 
vsip_cvview_f  *tmpView2  =  vsip_cvcreate_f(size,  VSIP_MEM_NONE); 


5 


if  (forwardFft  &&  inverseFft  &&  tmpViewl  &&  tmpView2)  { 

vsip_ccfftop_f(forwardFft,  in,  tmpViewl); 
vsip_cvmul_f(tmpViewl,  ref,  tmpView2); 
vs i p_c cf f to p_f ( i n ve rs e Ff t ,  tmpView2,  out); 
valid=l; 


if  (tmpViewl)  vsip_cvalldestroy_f(tmpViewl); 
if  (tmpView2)  vsip_cvalldestroy_f(tmpView2); 
if  (forwardFft)  vsi p_fft_d estroy_f (forward Fft) ; 
if  (inverseFft)  vs  i  p_fft_d  est  r  oy_f  ( i  n  ve  rs  e  Fft) ; 

return  valid; 


+ 

+ 

■ 


void  pulseCompress(const  vsip::Vector<  std::complex<float>  >  &in, 
const  vsip::Vector<  std::complex<float>  >  &ref, 
const  vsip::Vector<  std::complex<float>  >  &out)  { 

int  size  =  in.sizeQ; 


Observations 


VSIPL  code  additions  are  highlighted 

■  No  changes  to  VSIPL++  function  due  to  VSIPL++ 
support  for  C++  exceptions 

■  5  VSIPL++  SLOCS  vs.  17  VSIPL  SLOCS 

■  VSIPL  behavior  not  defined  by  specification  if  there  are 
errors  in  fft  and  vector  multiplication  calls 

■  For  example,  if  lengths  of  vector  arguments 
unequal,  implementation  may  core  dump,  stop  with 
error  message,  silently  write  past  end  of  vector 
memory,  etc 

■  FFT  and  vector  multiplication  calls  do  not  return 
error  codes 


vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1.0); 

vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFT_INV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 
inverseFft(  ref  *  forward Fft(in),  out ); 

} 
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Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Decimate  input  by  N  prior  to  first  FFT 


void  pulseCompress(  int  decimationFactor,  vsip_cvview_f  *in,  vsip_cvview_f  *ref,  vsip_cvview_f  *out)  { 
vsipjength  size  =  vsip_cvgetlength_f(in)  /  decimationFactor; 


vsip_fft_f  *forwardFft  =  vsip_ccfftop_create_f(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsip_fft_f  *inverseFft  =  vs  i  p_c  cf  f t  o  p_c  r  e  at  e_f  ( s  i  z  e ,  1.0/size,  VSIP_FFT_INV,  1,  VSIP_ALG_SPACE); 


vsip_cvview_f  *tmpViewl  =  vsip_cvcreate_f(size,  VSIP_MEM_NONE); 
vsip_cvview_f  *tmpView2  =  vsip_cvcreate_f(size,  VSIP_MEM_NONE); 


Observations 


vs i p_c vp u t st r i d e_f ( i n ,  decimationFactor); 
vsip_cvputlength_f(in,  size); 

vsip_ccfftop_f(forwardFft,  in,  tmpViewl); 
vsip_cvmul_f(tmpViewl,  ref,  tmpView2); 
vsip_ccfftop_f(inverseFft,  tmpView2,  out); 


vsip_cvalldestroy_f(tmpViewl); 
vs  i  p_cval  I  d  estr  oy_f  (tm  p  Vie  w2) ; 
vsi  p  _fft_d  estr  oy_f  (f  o  rwar  d  Fft) ; 
vsi  p_fft_d  estroy_f  (i  n  ve  rse  Fft) ; 

} 


void  pulseCompress(int  decimationFactor,  const  vsip::Vector<  std::complex<float>  >  &in, 
const  vsip::Vector<  std::complex<float>  >  &ref 
const  vsip::Vector<  std::complex<float>  >  &out)  { 
int  size  =  in.sizeO  /  decimationFactor; 
vsip::Domain<l>  decimatedDomain(0,  decimationFactor,  size); 


SLOC  count  doesn’t  change  all  that  much  for 
VSIPL  or  VSIPL++  code 

■  2  changed  line  for  VSIPL 

■  3  changed  lines  for  VSIPL++ 

■  2  additional  SLOCS  for  VSIPL 

■  1  additional  SLOC  for  VSIPL++ 

VSIPL  version  of  code  has  a  side-effect 

■  The  input  vector  was  modified  and  not 
restored  to  original  state 

■  This  type  of  side-effect  was  the  cause  of 
many  problems/bugs  when  we  first 
started  working  with  VSIPL 


vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1.0); 

vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip::cscalar_f,  vsip::FFT_INV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 


inverseFft(  ref  *  forwardFft(  in(decimatedDomain) ),  out ); 

} 
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Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Decimate  input  by  N  prior  to  first  FFT,  no  side-effects 


void  pulseCompress(  int  decimationFactor,  vsip_cvview_f  *in,  vsip_cvview_f  *ref,  vsip_cvview_f  *out)  { 
vsipjength  savedSize  =  vsip_cvgetlength_f(in); 
vsipjength  savedStride  =  vsip_cvgetstride_f(in); 
vsipjength  size  =  vsip_cvgetlength_f(in)  /  decimationFactor; 

vsipjftj  *forwardFft  =  vsip_ccfftop_create_f(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsipjftj  *inverseFft  =  vs  i  p_ccf f  to  p_c  r  e  at  e_f  (s  i  z  e ,  1.0/size,  VSIP_FFTJNV,  1,  VSIP_ALG_SPACE); 
vsip_cwiewj  *tmpViewl  =  vsip_cvcreatej(size,  VSIP_MEM_NONE); 
vsip_cwiew_f  *tmpView2  =  vsi  p_cvcreate_f  (size,  VSIP_MEM_NONE); 
vsip_cvputlength_f(in,  size); 
vs i p_c vp u tst r i d e_f (i n ,  decimationFactor); 
vsi  p_ccfftop_f  (forward  Fft ,  in,  tmpViewl); 
vsip_cvmulJ(tmpViewl,  ref,  tmpView2); 
vsi  p_ccfftop_f  (i  n  verse  Fft,  tmpView2,  out); 
vsip_cvputlength_f(in,  savedSize); 
vsip_cvputstride_f(in,  savedStride); 


Observations 


VSIPL  code  must  save  away  the  input  vector 
state  prior  to  use  and  restore  it  before  returning 


vsip_cvalldestroyJ(tmpViewl); 
vsip_cvalldestroyJ(tmpView2); 
vsi  p_fft_d  estroy_f  (forward  Fft) ; 
vsi  p  _fft_d  estroy J(i  n  ve  rse  Fft) ; 


Code  size  changes 

■  VSIPL  code  requires  4  additional  SLOCS 

■  VSIPL++  code  does  not  change  from  prior 


version 


} 


void  pulseCompress(int  decimationFactor,  const  vsip::Vector<  std::complex<float>  >  &in, 
const  vsip::Vector<  std::complex<float>  >  &ref 
const  vsip::Vector<  std::complex<float>  >  &out)  { 
int  size  =  in.size()  /  decimationFactor; 
vsip::Domain<l>  decimatedDomain(0,  decimationFactor,  size); 

vsip::FFT<vsip::Vector,  vsip::cscalarj,  vsip:  :cscalar_f ,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1.0); 

vsip::FFT<vsip::Vector,  vsip::cscalar_f,  vsip:  :cscalar_f ,  vsip::FFTJNV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 
inverseFft(  ref  *  forwardFft(  in(decimatedDomain) ),  out ); 

} 


Lockheed  Martin  Corporation 


Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Support  both  single  and  double  precision  floating  point 


void  pulseCompress(vsip_cvview_f  *in,  vsip_cvview_f  *ref,  vsip_cvview_f  *out)  { 
vsipjength  size  =  vsip_cvgetlength_f(in); 

vsip_fft_f  *forwardFft  =  vsip_ccfftop_create_f(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsip_fft_f  *inverseFft  =  vs  i  p_ccffto  p_c  r  eate_f  (s  ize ,  1.0/size,  VSIP_FFT_INV,  1,  VSIP_ALG_SPACE); 


} 


vsip_cvview_f  *tmpVie 
vsip_cvview_f  *tmpVie 

vs  i  p_ccf f  to  p_f  (f  o  r  war  d  I 
vsip_cvmul_f(tmpView 
vsip_ccfftop_f(inversel 

vs  i  p_c  val  I  d  est  roy_f  (t  m 
vs  i  p_c  val  I  d  est  roy_f  (t  m 
vsip_fft_destroy. 
vsip_fft_destroy^ 


void  pulseCompress(vsip_cvview^d  *in,  vsip_cvview_d  *ref,  vsip_cvview_d  *out)  { 
vsipjength  size  =  vsip_cvgetlength_d(in); 

vsip_fft_d  *forwardFft  =  vsip_ccfftop_create  d(size,  1.0,  VSIP_FFT_FWD,  1,  VSIP_ALG_SPACE); 
vsip_fft_d  *inverseFft  =  vs  i  p_c  off  to  p_c  r  e  at  e_d  (size,  1.0/size,  VSIP_FFT_INV,  1,  VSIP_ALG_SPACE); 

vsip_cvview_d  *tmpViewl  =  vsip_cvcreate_d(size,  VSIP_MEM_NONE); 
vsip_cvview_d  *tmpView2  =  vsip_cvcreate_d(size,  VSIP_MEM_NONE); 


vsip_ccfftop  d(forwardFft,  in,  tmpViewl); 
vsip_cvmul_d(tmpViewl,  ref,  tmpView2); 
vsip_ccfftop  d(inverseFft,  tmpView2,  out); 

vsip_cvalldestroy_d(tmpViewl); 
vsip_cvalldestroy_d(tmpView2); 
vsi  p_fft_d  estr  oy_d  (forward  Fft) ; 
vs  i  p_fft_d  estroy_d  (i  n  ve  rse  Fft) ; 


} 


Observations 


VSIPL++  code  has  same  SLOC  count  as  original 

■  Uses  C++  templates  (3  lines  changed) 

■  Syntax  is  slightly  more  complicated 

VSIPL  code  doubles  in  size 

■  Function  must  first  be  duplicated 

■  Small  changes  must  then  be  made  to  code 
(i.e.,  changing f  to d) 


template<class  T,  class  U,  class  V>  void  pulseCompress(const  T  &in,  const  U  &ref,  const  V  &out)  { 

intsize  =  in.size(); 

vsip::FFT<vsip::Vector,  typename  T::value_type,  typename  V::value_type,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1); 

vsip::FFT<vsip::Vector,  typename  T::valuejype,  typename  V::value_type,  vsip::FFTJNV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 
inverseFft(  ref  *  forwardFft(in),  out ); 
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Algorithm  Case  Study 


LOCKHEED  MARTIN 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Support  all  previously  stated  requirements 


void  pulseCompress(int  decimationFactor,  vsip 
vsipjength  savedSize  =  vsip_cvgetlength_ff 
vsipjength  savedStride  =  vsip_cvgetstride_f(| 

vsipjength  size  =  vsip_cvgetlength_f(in)  /  dei 

vsip_fft_f  *forwardFft  =  vsip_ccfftop_create_f(| 
vsip_fft_f  *inverseFft  =  vsip_ccfftop_create_f(: 

vsip_cvview_f  *tmpView1  =  vsip_cvcreate_f(s| 
vsip_cvview_f  *tmpView2  =  vsip_cvcreate_f(s| 

if  (forward Fft  &&  inverseFft  &&  tmpViewl  && 

{ 

vsip_cvputlength_f(in,  size); 
vsip_cvputstride_f(in,  decimationFactor); 

vsip_ccfftop_f(forwardFft,  in,  tmpView1)^_ 
vsip_cvmul_f(tmpView1,  ref,  tmpView2| 
vsip_ccfftop_f(inverseFft,  tmpView2,  oi] 

vsip_cvputlength_f(in,  savedSize); 
vsip_cvputstride_f(in,  savedStride); 

} 

if  (tmpViewl )  vsip_cvalldestroy_f(tmpV| 
if  (tmpView2)  vsip_cvalldestroy_f(tmpV 
if  (forward Fft)  vsip_fft_destroy_f(forwardl 
if  (inverseFft)  vsip_fft_destroy_f(inversefl 


cvview  Pin,  vsip  cvview  f*ref,  vsip  cvview  f*out){ 


} 


void  pulseCompress(int  decimationFactor,  vsip_cvview_d  *in,  vsip_cvview_d  *ref,  vsip_cvview_d  *out)  { 
vsipjength  savedSize  =  vsip_cvgetlength_d(in); 
vsipjength  savedStride  =  vsip_cvgetstride_d(in); 

vsipjength  size  =  vsip_cvgetlength_d(in)  /  decimationFactor; 

vsip_fft_d  *forwardFft  =  vsip_ccfftop_create_d(size,  1.0,  VSIP_FFT_FWD,  1 ,  VSIP_ALG_SPACE); 
vsip_fft_d  *inverseFft  =  vsip_ccfftop_create_d(size,  1.0/size,  VSIP_FFTJNV,  1,  VSIP_ALG_SPACE); 

vsip_cvview_d  *tmpView1  =  vsip_cvcreate_d(size,  VSIP_MEM_NONE); 
vsip_cvview_d  *tmpView2  =  vsip_cvcreate_d(size,  VSIP_MEM_NONE); 

if  (forward Fft  &&  inverseFft  &&  tmpViewl  &&  tmpView2) 

{ 

vsip_cvputlength_d(in,  size); 
vsip_cvputstride_d(in,  decimationFactor); 


vsip_ccfftop_d(forwardFft,  in,  tmpViewl); 
vsip_cvmul_d(tmpView1 ,  ref,  tmpView2); 
vsip_ccfftop_d (inverse Fft,  tmpView2,  out); 

vsip_cvputlength_d(in,  savedSize); 
vsip_cvputstride_d(in,  savedStride); 


} 


if  (tmpViewl )  vsip_cvalldestroy_d(tmpView1 ); 
if  (tmpView2)  vsip_cvalldestroy_d(tmpView2); 
if  (forwardFft)  vsip_fft_destroy_d(forwardFft); 
if  (inverseFft)  vsip_fft_destroy_d(inverseFft); 


Observations 


Final  SLOC  count 

■  VSIPL++ -  6  SLOCS 

■  VSIPL  -  40  SLOCS 

(20  each  for  double  and 
single  precision  versions) 


} 


template<class  T,  class  U,  class  V>  void  pulseCompress(int  decimationFactor,  const  T  &in,  const  U  &ref,  const  V  &out)  { 
int  size  =  in.size()  /  decimationFactor; 

vsip::Domain<l>  decimatedDomain(0,  decimationFactor,  size); 

vsip::FFT<vsip::Vector,  typename  T::value_type,  typename  V::value_type,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1); 

vsip::FFT<vsip::Vector,  typename  T::value_type,  typename  V::valuejype,  vsip::FFTJNV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 
inverseFft(  ref  *  forwardFft(  in(decimatedDomain) ),  out ); 
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Algorithm  Case  Study 


LOCKHEED  MARTIN / 


Simple  pulse  compression  kernel 

Main  Algorithm  output  =  ifft(  fft( input)  *  ref  ) 

Additional  requirement  Port  application  to  high  performance  embedded  systems 


void  pulseCompress(int  decimationFact< 
vsipjength  savedSize  =  vsip_cvgetle 
vsipjength  savedStride  =  vsip_cvgets 

vsipjength  size  =  vsip_cvgetlength_f(i 

vsip_fft_f  *forwardFft  =  vsip_ccfftop_cr 
vsip_fft_f  *inverseFft  =  vsip_ccfftop_crc 

vsip_cvview_f  *tmpView1  =  vsip_cvcre 
vsip_cvview_f  *tmpView2  =  vsip_cvcre 

if  (forward Fft  &&  inverseFft  &&  tmpVie\ 
{ 

vsip_cvputlength_f(in,  size); 
vsip_cvputstride_f(in,  decimationFact 

vsip_ccfftop_f(forwardFft,  in,  tmpViev\ 
vsip_cvmul_f(tmpView1,  ref,  tmpView 
vsip_ccfftop_f(inverseFft,  tmpView2, 

vsip_cvputlength_f(in,  savedSize); 
vsip_cvputstride_f(in,  savedStride); 

} 

if  (tmpViewl )  vsip_cvalldestroy_f(tmp 
if  (tmpView2)  vsip_cvalldestroy_f(tmp 
if  (forward Fft)  vsip_fft_destroy_f(forwar 
if  (inverseFft)  vsip_fft_destroy_f(inversi 


} 


n 


Observations 


Port  to  embedded  Mercury  system 

■  Hardware:  Mercury  VME  chassis  with  PowerPC  compute  nodes 

■  Software:  Mercury  beta  release  of  MCOE  6.0  with  linux  operating 
system.  Mercury  provided  us  with  instructions  for  using  GNU  g++ 
compiler 

■  No  lines  of  application  code  had  to  be  changed 
Port  to  embedded  Sky  system 

■  Hardware:  Sky  VME  chasis  with  PowerPC  compute  nodes 

■  Software:  Sky  provided  us  with  a  modified  version  of  their  standard 
compiler  (added  a  GNU  g++  based  front-end) 

■  No  lines  of  application  code  had  to  be  changed 

Future  availability  of  C++  with  support  for  C++  standard 

■  Improved  C++  support  is  in  Sky  and  Mercury  product  roadmaps 

■  Support  for  C++  standard  appears  to  be  improving  industry  wide 


template<class  T,  class  U,  class  V>  void  pulseCompress(int  decimationFactor,  const  T  &in,  const  U  &ref,  const  V  &out)  { 
int  size  =  in.size()  /  decimationFactor; 

vsip::Domain<l>  decimatedDomain(0,  decimationFactor,  size); 

vsip::FFT<vsip::Vector,  typename  T::value_type,  typename  V::value_type,  vsip::FFT_FWD>  forwardFft  ((vsip::Domain<l>(size)),  1); 

vsip::FFT<vsip::Vector,  typename  T::value_type,  typename  V::value_type,  vsip::FFTJNV,  0,  vsip::SINGLE,  vsip::BY_REFERENCE>  inverseFft  ((vsip::Domain<l>(size)),  1.0/size); 
inverseFft(  ref  *  forwardFft(  in(decimatedDomain) ),  out ); 

} 
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